• This project takes on the challenge of reducing harmful content in AI across multiple languages by using translation to improve safety measures where direct data is lacking.

    Hi Impact
    Monday, March 11, 2024
  • Researchers in a recent study used a virtual team called 'Evil Geniuses' to test the safety of LLMs. They found that these AI agents are less robust against malicious attacks, provide more complex responses, and make it harder to detect inappropriate replies.

    Hi Impact
  • Beth Barnes' nonprofit METR is partnering with major AI companies like OpenAI and Anthropic to develop safety tests for advanced AI systems, a move echoed by government initiatives. The focus is on assessing risks such as AI autonomy and self-replication, though there's acknowledgment that safety evaluations are still in early stages and cannot guarantee AI safety. METR's work is seen as pragmatic, despite concerns that current tests may not be sufficiently reliable to justify the rapid advancement of AI technologies.

  • Google DeepMind introduced the Frontier Safety Framework to address risks posed by future advanced AI models. This framework identifies critical capability levels (CCLs) for potentially harmful AI capabilities, evaluates models against these CCLs, and applies mitigation strategies when thresholds are reached.

  • OpenAI formed a Safety and Security Committee after announcing the training of its new foundation model. This committee will be tasked with issuing recommendations to the board about actions to take as model capabilities continue to improve.

  • Anthropic's Responsible Scaling Policy aims to prevent catastrophic AI safety failures by identifying high-risk capabilities, testing models regularly, and implementing strict safety standards, with a focus on continuous improvement and collaboration with industry and government.

    Friday, May 24, 2024
  • Anthropic researchers have unveiled a method to interpret the inner workings of its large language model, Claude Sonnet, by mapping out millions of features corresponding to a diverse array of concepts. This interpretability could lead to safer AI by allowing specific manipulations of these features to steer model behaviors. The study demonstrates a significant step in understanding and improving the safety mechanisms of AI language models.

  • OpenAI has announced the formation of a new Safety and Security Committee to oversee risk management for its projects and operations. The company recently began training its next frontier model. The new Safety and Security Committee will be responsible for making recommendations about AI safety to the full company board of directors. It will be responsible for processes and safeguards related to alignment research, protecting children, upholding election integrity, assessing societal impacts, and implementing security measures.

  • Jan Leike, a former OpenAI researcher who resigned over AI safety concerns, has joined Anthropic to lead a new "superalignment" team focusing on AI safety and security. Leike's team will address scalable oversight, weak-to-strong generalization, and automated alignment research.

  • Zico Kolter, a Professor at Carnegie Mellon University and expert in AI safety and robustness, has joined OpenAI's Board of Directors and its Safety and Security Committee. His extensive research in AI safety, alignment, and model robustness will enhance OpenAI's efforts to ensure AI benefits humanity.

  • MIT and other institutions have launched the AI Risk Repository, a comprehensive database of over 700 documented AI risks, to help organizations and researchers assess and mitigate evolving AI risks using a two-dimensional classification system and regularly updated information.

  • SB-1047 passed the California State Assembly by a 45-11 vote. It now faces just one more procedural state Senate vote before heading to the governor's desk. The bill asks AI model creators to implement a 'kill switch' that can be activated if a model starts introducing novel threats to public safety and security. It has been criticized for focusing on risks from an imagined future AI rather than real present-day harms like deepfakes or misinformation.

  • OpenAI and Anthropic have agreed to allow the US government early access to their major new AI models before public release to enhance safety evaluations as part of a memorandum with the US AI Safety Institute.

  • The article discusses the urgent need for global cooperation in ensuring the safety of artificial intelligence (AI) as it becomes increasingly powerful and potentially dangerous. Drawing parallels to the Pugwash Conferences that addressed nuclear weapons during the Cold War, the piece highlights a recent initiative called the International Dialogues on AI Safety, which brings together leading AI scientists from both China and the West. This initiative aims to foster dialogue and develop a consensus on AI safety as a global public good. The article emphasizes that the rapid advancements in AI capabilities pose existential risks, including the potential loss of human control and malicious uses of AI systems. To address these risks, the scientists involved in the dialogues have proposed three main recommendations: 1. **Emergency Preparedness Agreements and Institutions**: The establishment of an international body to facilitate collaboration among AI safety authorities is crucial. This body would help states agree on necessary technical and institutional measures to prepare for advanced AI systems, ensuring a minimal set of effective safety preparedness measures is adopted globally. 2. **Safety Assurance Framework**: Developers of frontier AI must demonstrate that their systems do not cross defined red lines, such as those that could lead to autonomous replication or the creation of weapons of mass destruction. This framework would require rigorous testing and evaluation, as well as post-deployment monitoring to ensure ongoing safety. 3. **Independent Global AI Safety and Verification Research**: The article calls for the creation of Global AI Safety and Verification Funds to support independent research into AI safety. This research would focus on developing verification methods that enable states to assess compliance with safety standards and frameworks. The piece concludes by underscoring the importance of a collective effort among scientists, states, and other stakeholders to navigate the challenges posed by AI. It stresses that the ethical responsibility of scientists, who understand the technology's implications, is vital in correcting the current imbalance in AI development, which is heavily influenced by profit-driven motives and national security concerns. The article advocates for a proactive approach to ensure that AI serves humanity's best interests while mitigating its risks.